SYS 6018 | Spring 2021 | University of Virginia
H. Diana McSpadden
When natural disasters or other emergencies result in compromised communications and transportation challenges, it can be impossible within a reasonable time to locate survivors using human-only methods. This disaster relief toy problem seeks to discover if an algorithm can effectively identify images corresponding to people who require relief.
To investigate whether this is possible a data set was provided containing RGB color values corresponding to pixels. The pixels are from high-resolution images taken by an aircraft above Haiti after the 2010 earthquake. Blue tarps had been distributed to survivors, but rescue workers did not have information about where survivors relocated after receiving the blue tarps.
Blue tarps have a distinct color when compared to other identifiable elements in Haiti. The training data set includes 63,241 RGB values for pixels in the toy problem’s images. The training data set includes a label for each pixel. The labels include:Below the training data is explored, and various modeling methods are applied to determine if, and which method can be effectively used to identify blue tarps by RGB values.
# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
library(caret)
library(boot)
library(regclass)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)
library(e1071)
library(caret)
library(boot)
library(glmnet)
library(yardstick)filename = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng
haiti <- read_csv(filename)
print(dim(haiti))#> [1] 63241 4
Below are the first 6 rows of the training dataset.
head(haiti)#> # A tibble: 6 x 4
#> Class Red Green Blue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are each values between 0 and 255 used in the additive RBG color model.
To prepare the data for exploratory data analysis I make Class a factor.
haiti %>%
mutate(Class = factor(Class)) #> # A tibble: 63,241 x 4
#> Class Red Green Blue
#> <fct> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
#> 7 Vegetation 71 72 51
#> 8 Vegetation 69 70 49
#> 9 Vegetation 68 70 49
#> 10 Vegetation 67 70 50
#> # ... with 63,231 more rows
Examine the numbers and percentages in each of the 5 classes:
haiti %>%
group_by(Class) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 5 x 3
#> Class N Perc
#> * <chr> <int> <dbl>
#> 1 Blue Tarp 2022 3
#> 2 Rooftop 9903 16
#> 3 Soil 20566 33
#> 4 Various Non-Tarp 4744 8
#> 5 Vegetation 26006 41
The records are not evenly distributed between the categories. Of the Classes, Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.
It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.
After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001
I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I want to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values, i.e, a manual “scaling” of the new parameters.
haitiBinary = haiti %>%
mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))
haitiBinarySqrs = haiti %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))Re-examine the numbers and percentages in each of the 2 classes:
haitiBinary %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 61219 97
#> 2 1 2022 3
redplot <- ggplot(haiti, aes(x=Class, y=Red)) +
geom_boxplot(col='red')
greenplot <- ggplot(haiti, aes(x=Class, y=Green)) +
geom_boxplot(col='darkgreen')
blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplot, greenplot, blueplot)redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB) ### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?
redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB)For the 5-class box plots:
“Blue Tarp” as the “positive” result, and other results as the “negative” result.
Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions
For the 2-class box plots:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.
Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.
Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
For the 2-class box plots with the additive square values:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.
The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.
GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
These correlations make sense as the pixels were of highly saturated/“additive” colors. There are few pixels in the data set with low values for R,G,B.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(haitiBinarySqrs[-1], progress = F)#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.
To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.
fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
fiveCat3D = fiveCat3D %>%
layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
fiveCat3D5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.
The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
binary3D = binary3D %>%
layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))
binary3D2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. Visually, there is an near-distinct linear boundary between the blue tarp and non-blue tarp observations.
For logistic regression, LDA, QDA, KNN and Penalized Logistic Regression Cross-Validation threshold performance use ROC and Accuracy for tuning.
The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive
In order to use R’s built-in factor function I set ClassBinary to a factor and order it “No”, “Yes”.
I also review the factor counts and create a dataframe named “train”.
levels(haitiBinarySqrs$ClassBinary)#> [1] "0" "1"
levels(haitiBinarySqrs$ClassBinary)=c("No","Yes")
levels(haitiBinarySqrs$ClassBinary)#> [1] "No" "Yes"
fct_count(haitiBinarySqrs$ClassBinary)#> # A tibble: 2 x 2
#> f n
#> <fct> <int>
#> 1 No 61219
#> 2 Yes 2022
Using 10-fold cross validation and p values in the collection (.1,.2,.3,.35,.4,.5,.6,.7,.8,.9) I tested three models:
** Model 1: Greatest Complexity:** \[ClassBinary = Blue + GBSqr + RBSqr \hspace{.3cm} | \hspace{.3cm} GBSqr = (Green + Blue)^2 \hspace{.3cm} and \hspace{.3cm} RBSqr = (Red + Blue)^2\]
** Model 2: Standard Additive Model:** \[ClassBinary = Blue + Green + Red\]
** Model 3: Simple Logistic Regression Model:** \[ClassBinary = Blue\]
library(yardstick)
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 17
out_nvar = k_end - k_start + 1
p_i = rep(NA, out_nvar)
# complex model measures
sensitivity_c_i = rep(NA, out_nvar)
specificity_c_i = rep(NA, out_nvar)
prec_c_i = rep(NA, out_nvar)
# additive model measures
sensitivity_a_i = rep(NA, out_nvar)
specificity_a_i = rep(NA, out_nvar)
prec_a_i = rep(NA, out_nvar)
# simple model measures
sensitivity_s_i = rep(NA, out_nvar)
specificity_s_i = rep(NA, out_nvar)
prec_s_i = rep(NA, out_nvar)
counter = 1
for (j in c(.1,.2,.3,.35,.4, .5,.6,.7,.8,.9))
{
p_i[counter] = j
accumulator_c_sens = 0
accumulator_c_spec = 0
accumulator_c_prec = 0
accumulator_a_sens = 0
accumulator_a_spec = 0
accumulator_a_prec = 0
accumulator_s_sens = 0
accumulator_s_spec = 0
accumulator_s_prec = 0
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# define additive model
glm.fits.additive = glm(ClassBinary ~ Blue+Green+Red, binomial, data = trainData)
# define simple model
glm.fits.simple = glm(ClassBinary ~ Blue, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.complex$.prediction) > 1)
{
accumulator_c_sens = accumulator_c_sens + yardstick::sens(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_spec = accumulator_c_spec + yardstick::spec(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_prec = accumulator_c_prec + yardstick::precision(glm.pred.complex, ClassBinary, .prediction)[[3]]
}
# fit the additive model
glm.pred.additive = glm.fits.additive %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.additive$.prediction) > 1)
{
accumulator_a_sens = accumulator_a_sens + yardstick::sens(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_spec = accumulator_a_spec + yardstick::spec(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_prec = accumulator_a_prec + yardstick::precision(glm.pred.additive, ClassBinary, .prediction)[[3]]
}
# fit the simple model
glm.pred.simple= glm.fits.simple %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.simple$.prediction) > 1)
{
accumulator_s_sens = accumulator_s_sens + yardstick::sens(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_spec = accumulator_s_spec + yardstick::spec(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_prec = accumulator_s_prec + yardstick::precision(glm.pred.simple, ClassBinary, .prediction)[[3]]
}
}
sensitivity_c_i[counter] = accumulator_c_sens / 10
specificity_c_i[counter] = accumulator_c_spec / 10
prec_c_i[counter] = accumulator_c_prec /10
sensitivity_a_i[counter] = accumulator_a_sens / 10
specificity_a_i[counter] = accumulator_a_spec / 10
prec_a_i[counter] = accumulator_a_prec /10
sensitivity_s_i[counter] = accumulator_s_sens / 10
specificity_s_i[counter] = accumulator_s_spec / 10
prec_s_i[counter] = accumulator_s_prec /10
counter = counter + 1
}
outcome = data.frame(p_i, sensitivity_c_i, specificity_c_i, prec_c_i, sensitivity_a_i, specificity_a_i, prec_a_i, sensitivity_s_i, specificity_s_i, prec_s_i)
print(outcome, n = nrow(outcome))#> Error in print.default(m, ..., quote = quote, right = right, max = max): invalid 'na.print' specification
SLR Model 3: ClassBinary = Blue
Model 3, with the square terms, performed better on all measures.
Because geographic distribution of blue tarps is not included in this data, I will focus on True Positive Rate, because I want to identify the most blue tarps that I can. I could see how in some situations, with limited personelle and resources, other measure may be of more utility. But, I will assume infinite resources, which leads me to use True Positive Rate as my measure for tuning and model selection.
Below I use the entire training data set on the model with p = 0.1 to get the training Accuracy, TPR, FPR, and Precision:
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
out_lr.complex = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.complex = bind_rows(out_lr.complex,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.complex$.prediction,
fitted = glm.pred.complex$.fitted))
}
caret::confusionMatrix(out_lr.complex$prediction, out_lr.complex$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 60987 41
#> Yes 232 1981
#>
#> Accuracy : 0.9957
#> 95% CI : (0.9951, 0.9962)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9333
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9962
#> Specificity : 0.9797
#> Pos Pred Value : 0.9993
#> Neg Pred Value : 0.8952
#> Prevalence : 0.9680
#> Detection Rate : 0.9644
#> Detection Prevalence : 0.9650
#> Balanced Accuracy : 0.9880
#>
#> 'Positive' Class : No
#>
For Logistic Regression, my calculations for Accuracy, TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
The following cross-validation measures were calculated with a threshold, p, of 0.1:library(ROCR)
# create a function so I can use repeatedly
function_ROC_AUC = function(fitted, truth, title)
{
##produce the numbers associated with classification table
rates = prediction(fitted, truth)
##store the true positive and false postive rates
roc_result = performance(rates, measure = "tpr", x.measure = "fpr")
par(mfrow=c(1,2))
##plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main = title)
lines(x = c(0,1), y = c(0,1), col="red")
plot(roc_result, avg= "vertical", spread.estimate="boxplot", lwd=2,col='blue',
show.spread.at= seq(0.1, 0.9, by=0.1),
main= "Accuracy across cutoffs", xlab="Threshold cutoff")
##compute the AUC
auc = performance(rates, measure = "auc")
print(auc@y.values[[1]])
}
function_ROC_AUC(out_lr.complex$fitted, out_lr.complex$truth, "ROC For Tuned Complex Logisitic Regressions Model")#> [1] 0.9995643
The Logistic Regression ROC-AUC for the 10-fold cross-validated training data with p=0.1 is: 0.9996.
I trained the LDA model using 10-fold cross validation.
10-fold cross validation was used for p-values from 0.1 to .95. Tuning was performed using ROC.
I tested with thresholds in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95). The best performing threshold on accuracy was 0.3. For illustrative purposes I run with 0.3, and 0.4.
#y_true
set.seed(1976)
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)
out_lda_p = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_lda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
# check the threshold
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.2, .8)){
for (p in c(0.1, 0.2, 0.3, 0.4, 0.5)){
#y_hat = tibble(hat_pred = preds$Yes > p) %>% mutate(hat_value = ifelse(preds$Yes > p))
y_hat = preds %>% mutate(hat_value = factor(ifelse(Yes > p, "Yes", "No")))
#spec(two_class_example, truth, predicted)
spec = yardstick::spec(preds, y_true, y_hat$hat_value)[[3]]
#acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_lda_p = bind_rows(out_lda_p, tibble(threshold = p, TPR = spec))
}
}
out_lda_p#> # A tibble: 50 x 2
#> threshold TPR
#> <dbl> <dbl>
#> 1 0.1 0.858
#> 2 0.2 0.853
#> 3 0.3 0.848
#> 4 0.4 0.843
#> 5 0.5 0.824
#> 6 0.1 0.889
#> 7 0.2 0.885
#> 8 0.3 0.881
#> 9 0.4 0.881
#> 10 0.5 0.877
#> # ... with 40 more rows
#-- Get mean accuracy
perf_lda_p = out_lda_p %>%
group_by(threshold) %>%
summarize(mean_TPR = mean(TPR))
perf_lda_p %>% arrange(-mean_TPR)#> # A tibble: 5 x 2
#> threshold mean_TPR
#> <dbl> <dbl>
#> 1 0.1 0.882
#> 2 0.2 0.875
#> 3 0.3 0.859
#> 4 0.4 0.852
#> 5 0.5 0.846
The threshold with the greatest true positive rate using a LDA model uses a probability threshold of 0.1.
# tibble to save all predictions at the best threshold
out_best_lda = tibble()
out_all_preds = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_lda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.1, "Yes", "No")) %>% select(hat_value)
#- output
out_best_lda = bind_rows(out_best_lda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
out_all_preds = bind_rows(out_all_preds, tibble(Yes = preds$Yes))
}out_best_lda = out_best_lda %>% mutate(pred = factor(pred), truth = factor(truth))
caret::confusionMatrix(out_best_lda$pred, out_best_lda$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61153 284
#> Yes 66 1738
#>
#> Accuracy : 0.9945
#> 95% CI : (0.9939, 0.995)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9057
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9989
#> Specificity : 0.8595
#> Pos Pred Value : 0.9954
#> Neg Pred Value : 0.9634
#> Prevalence : 0.9680
#> Detection Rate : 0.9670
#> Detection Prevalence : 0.9715
#> Balanced Accuracy : 0.9292
#>
#> 'Positive' Class : No
#>
For LDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
function_ROC_AUC(out_all_preds$Yes, out_best_lda$truth, "ROC For Tuned LDA Model")#> [1] 0.9944908
The LDA AUC based on 10-fold cross-validated training data with a threshold of 0.3 is: 0.99.
I trained the QDA model using 10-fold cross validation.
10-fold cross validation was used and various thresholds were evaluated for best true positive rate.
0.1 produced the highest accuracy. For illustrative purposes I only test thresholds of 0.1, 0.2, an 0.3.
set.seed(1976)
# I will use the same folds calculated in the LDA code#
trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)
out_qda_p = tibble()
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.3, .4)){
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_qda, newdata = testData, type="prob")
#- evaluate fold k
#for (p in c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)){
for (p in c(0.1, 0.2, 0.3)){
y_true = testData$ClassBinary
# check the threshold
y_hat = preds %>% mutate(hat_value = factor(ifelse(Yes > p, "Yes", "No")))
#spec(two_class_example, truth, predicted)
spec = yardstick::spec(preds, y_true, y_hat$hat_value)[[3]]
#acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_qda_p = bind_rows(out_qda_p, tibble(threshold = p, TPR = spec))
}
}#-- Get mean accuracy
perf_qda_p = out_qda_p %>%
group_by(threshold) %>%
summarize(mean_TPR = mean(TPR))
perf_qda_p %>% arrange(-mean_TPR)#> # A tibble: 3 x 2
#> threshold mean_TPR
#> <dbl> <dbl>
#> 1 0.1 0.925
#> 2 0.2 0.914
#> 3 0.3 0.906
Based on the highest True Positive Rate of cross validation of the training data, I am selecting a threshold of 0.1 for the QDA model.
# tibble to save all predictions at the best threshold
out_best_qda = tibble()
out_all_preds_qda = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_qda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.1, "Yes", "No")) %>% select(hat_value)
#- output
out_best_qda = bind_rows(out_best_qda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
out_all_preds_qda = bind_rows(out_all_preds_qda, tibble(Yes = preds$Yes))
}out_best_qda = out_best_qda %>% mutate(pred = factor(pred), truth = factor(truth))
caret::confusionMatrix(out_best_qda$pred, out_best_qda$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61149 254
#> Yes 70 1768
#>
#> Accuracy : 0.9949
#> 95% CI : (0.9943, 0.9954)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9134
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9989
#> Specificity : 0.8744
#> Pos Pred Value : 0.9959
#> Neg Pred Value : 0.9619
#> Prevalence : 0.9680
#> Detection Rate : 0.9669
#> Detection Prevalence : 0.9709
#> Balanced Accuracy : 0.9366
#>
#> 'Positive' Class : No
#>
For QDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
function_ROC_AUC(out_all_preds_qda$Yes, out_best_qda$truth, "ROC For Tuned QDA Model")#> [1] 0.997386
The QDA AUC based on 10-fold cross-validated training data with a threshold of 0.1 is: 0.997.
After testing over ranges from 1 to 51, the best k was 27. The code used is shown below, but not evaluated.
set.seed(1976)
trControl.knn <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trControl.knn, tuneGrid = expand.grid(k = 26:28))
knn.cv.modelset.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 17:35))
knn.cv.modelI order to speed up this Rmd file I am no longer evaluating the following code chunk that evaluated k = 27 - 51. When this chunk is run, the best k was still 27.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 27:51))
knn.cv.model
From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, appear reasonable selections for the cross-validated training data.
Running manual KNN cross validation to tune for p took significant time for my laptop to process. I ran with smaller lists of p just to get a feel for the results.
The threshold 0.1 returned the same true positive rate with k == 27.
For processing speed time, I commented out the extending loop of threshold values, and limited the list of 0.1 and 0.2 for illustrative purposes.
library(class)
out_knn_p = tibble()
set.seed(1976)
haitiBinarySqrsShuffled = haitiBinarySqrsShuffled %>% mutate(ClassBinaryTF = factor(if_else(ClassBinary == "No", FALSE, TRUE)))
for (j in 1:10)
{
#Segement your data by fold using the which() function
testIndexes = which(folds==j,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_knn <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "knn", tuneGrid = data.frame(k=27))
# test with the fold's test data
preds = predict(train_knn, newdata = testData, type="prob")
#for (p in c(.1,.2,.3,.4,.5))
for (p in c(.1,.2))
{
#- evaluate fold k
y_true = testData$ClassBinaryTF
# set the threshold to p
thres = p
y_hat = preds %>% mutate(hat_value = (Yes > thres), hat_value = factor(hat_value))
#spec(two_class_example, truth, predicted)
spec = yardstick::spec(preds, y_true, y_hat$hat_value)[[3]]
#- output
out_knn_p = bind_rows(out_knn_p, tibble(threshold = p, TPR = spec))
}
}
#out_knn_p#-- Get mean accuracy
perf_knn_p = out_knn_p %>%
group_by(threshold) %>%
summarize(mean_TPR = mean(TPR))
perf_knn_p %>% arrange(-mean_TPR)#> # A tibble: 2 x 2
#> threshold mean_TPR
#> <dbl> <dbl>
#> 1 0.1 0.987
#> 2 0.2 0.984
After testing the following values of p: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, p == 0.1 is the value that produced the highest true positive rate.
out_knn_best_thres = tibble()
for (j in 1:10)
{
#Segement your data by fold using the which() function
testIndexes = which(folds==j,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_knn_best_thres <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "knn", tuneGrid = data.frame(k=27))
# test with the fold's test data
preds_best_thres = predict(train_knn_best_thres, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinaryTF
y_hat = preds_best_thres %>% mutate(hat_value = (Yes > 0.1), hat_value = factor(hat_value))
out_knn_best_thres = bind_rows(out_knn_best_thres,
tibble(y_true = y_true, pred_Yes = y_hat$Yes, pred_No = y_hat$No, y_hat = y_hat$hat_value))
}out_knn_best_thres = out_knn_best_thres %>%
mutate(y_true = factor(y_true), y_hat = factor(y_hat))
caret::confusionMatrix(out_knn_best_thres$y_hat, out_knn_best_thres$y_true)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 60918 27
#> TRUE 301 1995
#>
#> Accuracy : 0.9948
#> 95% CI : (0.9942, 0.9954)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9214
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9951
#> Specificity : 0.9866
#> Pos Pred Value : 0.9996
#> Neg Pred Value : 0.8689
#> Prevalence : 0.9680
#> Detection Rate : 0.9633
#> Detection Prevalence : 0.9637
#> Balanced Accuracy : 0.9909
#>
#> 'Positive' Class : FALSE
#>
For KNN, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
function_ROC_AUC(out_knn_best_thres$pred_Yes, out_knn_best_thres$y_true, "ROC For Tuned KNN | k = 27 Model")#> [1] 0.9997087
Subset selection using glmnet/ElasticNet provides an opportunity for me to see if additional predictors are of use improving the precision of our model.
I added additional terms to the model. I selected additional additive relations between the colors because RGB color is, by design, an additive color model. The interplay between Red, Blue, and Green are intuitively significant because the combinations of these values create the visible spectrum of color.
The predictors in the ElasticNet model are:haitiBinaryFull = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2))) head(haitiBinaryFull)#> # A tibble: 6 x 9
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr
#> <chr> <dbl> <dbl> <dbl> <I<dbl>> <I<dbl>> <fct> <I<dbl>> <I<dbl>>
#> 1 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 2 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 3 Vegetation 64 66 49 13.2 12.8 No 16900 32041
#> 4 Vegetation 75 82 53 18.2 16.4 No 24649 44100
#> 5 Vegetation 74 82 54 18.5 16.4 No 24336 44100
#> 6 Vegetation 72 76 52 16.4 15.4 No 21904 40000
I did not scale the predictors because the glmnet library will scale to standard deviation units. For the glmnet library, the training data must be in matrix format:
frmla = as.formula(ClassBinary~Red+Green+Blue+GBSqr+RBSqr+RGSqr+RGBSqr)
x.haitiFull.train = model.matrix(frmla, data = haitiBinaryFull)[,-1] # removing the intercept term from the formula
# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiFull.train = (haitiBinaryFull %>% mutate(ClassBinary = ifelse(ClassBinary == 'No', FALSE, TRUE)))$ClassBinaryUsing penalized logistic regression (PLR), I evaluated three different PLR methods: ridge, lasso, and elasticnet (a combination of ridge and lasso). I tuned both lambda and the probability threshold (p). Based on best testing of the threshold tuning parameter, I considered p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95.
For the sake of processing speed I commented out the loop for the extended p values (0.2 was the best threshold when I considered the extended list), and kept 0.1, 0.2, 0.3 for illustrative purposes.
After testing ridge, elasticnet, and lasso, lasso performed the best. For speed of processing, I am only demonstrating the loop with a == 1 == lasso.
Accuracy was used for tuning parameter and threshold tuning:
library(glmnet)
set.seed(1976)
# number of folds
K = 10
# make folds
folds = rep(1:K, length=nrow(x.haitiFull.train))
out = tibble()
# LOOP FOR alpha: 0 (ridge), 0.5 (elasticnet), and 1 (lasso)
for (a in c(1)){
# lambda may be different for the different PLR methods, so this is decided within the alpha loop
#-- Get lambda sequence so consistent over all folds
lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambda
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95)){
for (p in c(0.1,0.2,0.3)){
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the alpha model over all lambdas in lam_seq
fit.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = a, # use the alpha for the loop
family="binomial", # logistic regression
lambda = lam_seq) # all models in this alpha loop use same lambda sequence
#- make predictions on test set fold k
pred = predict(fit.model, x.haitiFull.train[ind.test, ], s=lam_seq, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# set the threshold to p
y_hat = pred > p
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out = bind_rows(out,
tibble(accuracy = acc,
lambda = lam_seq,
n_train = length(ind.train),
n_test = length(ind.test),
alpha = a,
thres = p,
k = k))
}
}
}#head(out)
#-- Get mean TPR by alpha, threshold, and lambda for 10 fold cross validation
perf = out %>%
group_by(alpha, thres, lambda) %>%
summarize(mean_accuracy = mean(accuracy), se_accuracy = sd(accuracy)/k) #> `summarise()` has grouped output by 'alpha', 'thres', 'lambda'. You can override using the `.groups` argument.
head(perf %>% arrange(-mean_accuracy) %>% top_n(10))#> Selecting by se_accuracy
#> # A tibble: 6 x 5
#> # Groups: alpha, thres, lambda [1]
#> alpha thres lambda mean_accuracy se_accuracy
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.2 0.0000563 0.996 0.000467
#> 2 1 0.2 0.0000563 0.996 0.000233
#> 3 1 0.2 0.0000563 0.996 0.000156
#> 4 1 0.2 0.0000563 0.996 0.000117
#> 5 1 0.2 0.0000563 0.996 0.0000933
#> 6 1 0.2 0.0000563 0.996 0.0000778
Based on accuracy the best threshold == 0.2, lasso (alpha == 1) and lambda == 1.530602e-05 performed the best.
From the manual cross-validation testing of accuracy the PLR lasso method with lambda of 5.630136e-05 and a threshold of 0.2 produced the greatest mean accuracy: 0.996.
out_confusion = tibble()
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the lasso
cv.glmnet.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = 1,
family="binomial", # logistic regression
lambda = lam_seq)
#- make predictions on test set fold k
pred = predict(cv.glmnet.model, x.haitiFull.train[ind.test, ], s=0.00005630136, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# set the threshold
y_hat = pred > 0.2
#- output
out_confusion = bind_rows(out_confusion,
tibble(truth = y_true,
glmnet.fitted = y_hat))
}The optimal lambda:
coef(cv.glmnet.model, 0.00005630136)#> 8 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -9.9118040564
#> Red -0.1218057899
#> Green -0.0651983875
#> Blue 0.3230103224
#> GBSqr .
#> RBSqr .
#> RGSqr -0.0001042983
#> RGBSqr .
This resulting model is interesting. Cross-validated lasso, with p = 0.2 set the coefficients for GBSqr, RBSqr and RGBSqr to 0. This could be due to collinearity with Blue, or because the variables are truly not significant to the accuracy of the model. This did determine that the new variable RGSqr was significant.
out_confusion %>%
mutate(truth = factor(truth), glmnet.fitted = factor(glmnet.fitted)) %>%
conf_mat(truth, glmnet.fitted)#> Truth
#> Prediction FALSE TRUE
#> FALSE 61046 81
#> TRUE 173 1941
assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family="binomial")#> $deviance
#> 1
#> 1.367596
#> attr(,"measure")
#> [1] "Binomial Deviance"
#>
#> $class
#> 1
#> 0.9677419
#> attr(,"measure")
#> [1] "Misclassification Error"
#>
#> $auc
#> [1] 0.9996217
#> attr(,"measure")
#> [1] "AUC"
#>
#> $mse
#> 1
#> 0.4912767
#> attr(,"measure")
#> [1] "Mean-Squared Error"
#>
#> $mae
#> 1
#> 0.9881718
#> attr(,"measure")
#> [1] "Mean Absolute Error"
AUC for the lasso PLR model with a lambda of 0.00005630136 is 0.9996.
Using lasso penalized logistic regression with a threshold = 0.2, lambda = 0.00005630136:I was curious to examine the performance of the model selected by lasso in a logistic regression model with a threshold also selected by lasso (0.2). This is the selected PLR model without the lambda penalty term.
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsFullShuffled = haitiBinaryFull[sample(nrow(haitiBinaryFull)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsFullShuffled)),breaks=10,labels=FALSE)
out_lr.full = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsFullShuffled[testIndexes, ]
trainData = haitiBinarySqrsFullShuffled[-testIndexes, ]
# define complex model
glm.fits.full = glm(ClassBinary ~ Blue+Green+Red+RGSqr, binomial, data = trainData)
# fit the glmnet lasso model
glm.pred.full = glm.fits.full %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .2, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.full = bind_rows(out_lr.full,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.full$.prediction,
fitted = glm.pred.full$.fitted))
}
caret::confusionMatrix(out_lr.full$prediction, out_lr.full$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61039 73
#> Yes 180 1949
#>
#> Accuracy : 0.996
#> 95% CI : (0.9955, 0.9965)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.937
#>
#> Mcnemar's Test P-Value : 2.662e-11
#>
#> Sensitivity : 0.9971
#> Specificity : 0.9639
#> Pos Pred Value : 0.9988
#> Neg Pred Value : 0.9155
#> Prevalence : 0.9680
#> Detection Rate : 0.9652
#> Detection Prevalence : 0.9663
#> Balanced Accuracy : 0.9805
#>
#> 'Positive' Class : No
#>
function_ROC_AUC(out_lr.full$fitted, out_lr.full$truth, "ROC For Tuned Lasso Model")#> Error in prediction(fitted, truth): could not find function "prediction"
Using logistic regression with a threshold = 0.2 and the model selected by lasso:
It was interesting that this model performed slightly better on Precision, and FPR, but not as well based on other measures as the logistic regression model Red + Green + Blue + RBSqr + GBSqr.
haitiBinaryRF = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue))
head(haitiBinaryRF)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 No 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 No 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 No 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 No 21904 40000 5472 3744 3952
The first random forest tuning parameter I will test for is number of trees and mtry. I will test ntree values by cross validation.
I did run the ntree loop for 250, 500, and 1000 trees. The best performing number of trees was 1000.
I tuned with a list of mtry values of 3,4,5,6,7,8,9 and 10. 4 was the best performing. In order to lessen the amount of timed needed to knit this Rmd I am not evaluating the following code block.
control = trainControl(method = 'cv', number = 10, search = 'grid')
# possible mtry values
tunegrid = expand.grid(.mtry = c(3,4,5,6,7,8,9,10))
modellist = list()
#train with different ntree parameters
#for (ntree in c(250,500,1000)){
for (ntree in c(1000)){
set.seed(123)
rf.fit = train(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB,
data = haitiBinaryRF,
method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control,
ntree = ntree)
modellist[[toString(ntree)]] = rf.fit
}
# look at the results of the cross validation with different ntree and mtry values
results = resamples(modellist)
summary(results)rf.fit$bestTunesummary(rf.fit$finalModel$ntree)as.data.frame(rf.fit$finalModel$importance) %>% arrange(MeanDecreaseGini)The above code took such a long time to run, I am not longer evaluating in my Rmd. Here are screen shots of the output to support my further evaluation of the selected random forest model:
RF Tuning Results
RF MTRY Results
RF Importance Results
Next, I need to test for the best probability threshold using the tuned tree with number of predictors considered at each split == 4, and number of trees = 1000. I will use accuracy for selecting the best p value.
set.seed(1976)
#Randomly shuffle the data
haitiRFShuffled = haitiBinaryRF[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)
head(haitiRFShuffled)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB
#> <chr> <dbl> <dbl> <dbl> <I<db> <I<db> <fct> <I<dbl> <I<db> <dbl> <dbl>
#> 1 Vegeta~ 75 66 63 16.6 19.0 No 19881 41616 4950 4725
#> 2 Rooftop 214 207 181 151. 156. No 177241 362404 44298 38734
#> 3 Rooftop 145 131 111 58.6 65.5 No 76176 149769 18995 16095
#> 4 Vegeta~ 90 96 71 27.9 25.9 No 34596 66049 8640 6390
#> 5 Rooftop 167 137 117 64.5 80.7 No 92416 177241 22879 19539
#> 6 Vegeta~ 105 111 79 36.1 33.9 No 46656 87025 11655 8295
#> # ... with 1 more variable: GB <dbl>
I did run this loop with threshold 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 0.1 performed the best using true positive rate.
For the sake of faster compilation time, I am limited the threshold loop to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6.
# create storage variables for the p value, Accuracy, ROC, Specificity, and Sensitivity
all_out_yhat_ytrue = tibble()
out_yhat_ytrue = tibble()
out_rf = tibble()
# reset the fold accumulator tibble
out_yhat_ytrue = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segment data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiRFShuffled[testIndexes, ]
trainingData = haitiRFShuffled[-testIndexes, ]
#small training and test sets
# comment this out to get real values
#testData = testData[1:100,]
#trainingData= head(trainingData, 500)
#end of small training and test sets
# training
rf.haiti = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = trainingData, mtry=4, ntree=1000)
# test
rf.preds = predict(rf.haiti, newdata = testData, type="prob")
#- evaluate fold
y_true = testData %>% mutate(ClassBinary = (ClassBinary == "Yes"), Class0or1 = ifelse(ClassBinary == FALSE,0,1))
y_true_val = y_true$ClassBinary
#y_true
#for (j in c(.1, .2,.3,.4,.5,.6,.7,.8,.9))
for (j in c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6)) {
# set the threshold to p
y_hat = rf.preds[,2] > j
#y_hat
#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat, truth0or1 = y_true$Class0or1, X0 = rf.preds[,1], X1 = rf.preds[,2]))
all_out_yhat_ytrue = bind_rows(all_out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat))
acc = sum(out_yhat_ytrue$hat == out_yhat_ytrue$true) / nrow(out_yhat_ytrue)
out_rf = bind_rows(out_rf,
tibble(accuracy = acc, thres = j))
}
}Below, I examine the best threshold tested by both true positive rate and accuracy.
#head(out_yhat_ytrue)
#head(all_out_yhat_ytrue)
all_out_yhat_ytrue %>%
mutate(true = factor(true), hat=(factor(hat))) %>%
group_by(thres) %>%
mutate(TPR = yardstick::spec_vec(true, hat)) %>%
group_by(thres) %>%
summarize_at(vars(TPR), list(mean_TPR=mean)) %>%
arrange(-mean_TPR)#> # A tibble: 6 x 2
#> thres mean_TPR
#> <dbl> <dbl>
#> 1 0.1 0.985
#> 2 0.2 0.978
#> 3 0.3 0.970
#> 4 0.4 0.959
#> 5 0.5 0.947
#> 6 0.6 0.935
#head(out_rf)
#head(all_out_yhat_ytrue)
out_rf %>%
group_by(thres) %>%
summarize_at(vars(accuracy), list(mean_acc=mean))%>%
arrange(-mean_acc)#> # A tibble: 6 x 2
#> thres mean_acc
#> <dbl> <dbl>
#> 1 0.6 0.997
#> 2 0.5 0.997
#> 3 0.4 0.997
#> 4 0.3 0.997
#> 5 0.2 0.997
#> 6 0.1 0.996
Random forest cross validation with:
These tuning parameters produced the best accuracy.
Next, I produce the confusion matrix and ROC curve
# filter only the true and hat values where thres == 0.3
best_rf_df = all_out_yhat_ytrue %>%
mutate(factortrue = factor(true), factorhat = factor(hat)) %>%
filter(thres == 0.1)
head(best_rf_df)#> # A tibble: 6 x 5
#> thres true hat factortrue factorhat
#> <dbl> <lgl> <lgl> <fct> <fct>
#> 1 0.1 FALSE FALSE FALSE FALSE
#> 2 0.1 FALSE FALSE FALSE FALSE
#> 3 0.1 FALSE FALSE FALSE FALSE
#> 4 0.1 FALSE FALSE FALSE FALSE
#> 5 0.1 FALSE FALSE FALSE FALSE
#> 6 0.1 FALSE FALSE FALSE FALSE
caret::confusionMatrix(best_rf_df$factorhat, best_rf_df$factortrue)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 60954 30
#> TRUE 265 1992
#>
#> Accuracy : 0.9953
#> 95% CI : (0.9948, 0.9959)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9287
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9957
#> Specificity : 0.9852
#> Pos Pred Value : 0.9995
#> Neg Pred Value : 0.8826
#> Prevalence : 0.9680
#> Detection Rate : 0.9638
#> Detection Prevalence : 0.9643
#> Balanced Accuracy : 0.9904
#>
#> 'Positive' Class : FALSE
#>
Produce the ROC and calculate the AUC for the tuned random forest model:
function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$truth0or1, "ROC For Tuned RF Model")#> [1] 0.9942561
mtry and ntree were tuned using accuracy, threshold was tuned by true positive rate.
Using random forests with a threshold = 0.1, mtry == 4, and ntree == 1000 the model produced the following results:
Based on the 3D visualization of the dataset I generated earlier, I believe the data will separate well with a linear kernel. Another reason I only investigate the linear kernel is the poor performance of the QDA model compared to the LDA model.
Using cross validation, I will tune a linear kernel with cost values that include 0.001, 0.01, 0.1, 1, 10, and 100.
I ran the svm with possible costs including 0.01, 0.1, 1, and 10 and probability thresholds .2, .3, .4, .5, .6, .7, and .8. The best performing model based on TPR uses cost == 1 and threshold == 0.2. For speed of knitting this Rmd I only include those values in the code below:
out_svm = tibble()
out_svm_accuracy = tibble()
#Perform 10 fold cross validation with various costs
#Create 10 equally size folds
set.seed(1976)
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)
#for (cost_val in c( 0.01, 0.1, 1, 10))
for (cost_val in c(0.1,1,10))
{
for(i in 1:10)
{
# Use my previously generated folds
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiRFShuffled[testIndexes, ]
trainData = haitiRFShuffled[-testIndexes, ]
svm_model = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=trainData, kernel="linear", ranges=list(cost=c(cost_val)), probability=TRUE)
pred = predict(svm_model$best.model, newdata = testData, probability=TRUE)
# to speed this up, loop through the preds after the svm has been fit.
#for (thres in c(.2,.3,.4,.5,.6,.7,.8))
for (thres in c(0.1,0.2,0.3,0.4,0.5,0.6))
{
y_hat = attr(pred, "probabilities")[,"Yes"] > thres
acc = sum(y_hat == (testData$ClassBinary == "1")) / nrow(testData)
out_svm = bind_rows(out_svm,
tibble(thres = thres, cost = cost_val, pred_value = y_hat, truth_value = testData$ClassBinary, accuracy = acc, noProb = attr(pred, "probabilities")[,"No"], yesProb = attr(pred, "probabilities")[,"Yes"]))
}
}
}#head(out_yhat_ytrue)
#head(all_out_yhat_ytrue)
out_svm %>% mutate(truth_value = (truth_value == 'Yes')) %>%
mutate(true = factor(truth_value), hat=(factor(pred_value))) %>%
group_by(thres, cost) %>%
mutate(TPR = yardstick::spec_vec(true, hat)) %>%
group_by(thres, cost) %>%
summarize_at(vars(TPR), list(mean_TPR=mean)) %>%
arrange(-mean_TPR)#> # A tibble: 18 x 3
#> # Groups: thres [6]
#> thres cost mean_TPR
#> <dbl> <dbl> <dbl>
#> 1 0.1 1 0.975
#> 2 0.1 0.1 0.975
#> 3 0.1 10 0.974
#> 4 0.2 0.1 0.964
#> 5 0.2 1 0.963
#> 6 0.2 10 0.958
#> 7 0.3 0.1 0.946
#> 8 0.3 1 0.945
#> 9 0.3 10 0.940
#> 10 0.4 0.1 0.930
#> 11 0.4 1 0.927
#> 12 0.4 10 0.925
#> 13 0.5 0.1 0.913
#> 14 0.5 1 0.909
#> 15 0.5 10 0.908
#> 16 0.6 0.1 0.898
#> 17 0.6 1 0.897
#> 18 0.6 10 0.893
The highest performing, based on true positive rate cross-validated svm used a cost of 1 and a threshold of 0.1.
best_svm = out_svm %>% filter(thres == 0.1 & cost == 1.00) %>% mutate(truth_value = (truth_value == "Yes")) %>% mutate(pred_value = factor(pred_value), truth_value = factor(truth_value))
head(best_svm)#> # A tibble: 6 x 7
#> thres cost pred_value truth_value accuracy noProb yesProb
#> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 0.1 1 FALSE FALSE 0.965 0.989 0.0114
#> 2 0.1 1 FALSE FALSE 0.965 1.00 0.000000725
#> 3 0.1 1 FALSE FALSE 0.965 1.00 0.0000216
#> 4 0.1 1 FALSE FALSE 0.965 1.00 0.000388
#> 5 0.1 1 FALSE FALSE 0.965 1.00 0.000000201
#> 6 0.1 1 FALSE FALSE 0.965 1.00 0.0000405
caret::confusionMatrix(best_svm$pred_value, best_svm$truth_value)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 61029 50
#> TRUE 190 1972
#>
#> Accuracy : 0.9962
#> 95% CI : (0.9957, 0.9967)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9407
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9969
#> Specificity : 0.9753
#> Pos Pred Value : 0.9992
#> Neg Pred Value : 0.9121
#> Prevalence : 0.9680
#> Detection Rate : 0.9650
#> Detection Prevalence : 0.9658
#> Balanced Accuracy : 0.9861
#>
#> 'Positive' Class : FALSE
#>
function_ROC_AUC(best_svm$yesProb, best_svm$truth_value, "ROC Training SVM")#> Error in prediction(fitted, truth): could not find function "prediction"
Using SVM with a threshold = 0.1 and a cost of 1.0:
Results of cross-validation on training data have been re-calculated since submission of Part One. For Part One I had used accuracy for tuning. In Part Two, I re-trained and used true positive rate for most of the parameter tuning.
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | 0.9996 | 0.1 | 0.9957 | 0.9802 | 0.0038 | 0.896 |
| LDA | N/A | 0.994 | 0.1 | 0.9888 | 0.8827 | 0.0077 | 0.7909 |
| QDA | N/A | 0.997 | 0.1 | 0.9922 | 0.926 | 0.006 | 0.8441 |
| KNN | k = 27 | 0.9997 | 0.1 | 0.9948 | 0.9866 | 0.005 | 0.8689 |
| Pen Log Reg | alpha = 1, lambda = 5.630136e-05 | 0.9996 | 0.2 | 0.996 | 0.959 | 0.0028 | 0.918 |
| Random Forest | mtry = 4, ntree = 1000 | 0.994 | 0.1 | 0.9954 | 0.985 | 0.004 | 0.884 |
| SVM | Cost = 1.0 | 0.9995 | 0.1 | 0.9962 | 0.975 | 0.003 | 0.9121 |
Part One’s training conclusions were written based on tuning by accuracy on all models. Before submitting Part Two, I reworked much of the training to use true positive rate for parameter and threshold tuning.
Training Data Linearly Separates, but Model is More Than Just Blue
Visually, the training data linearly separates very well. Even the 5-class, un-transformed data set was visually separable in the 3-D R-G-B visualization of “Blue Tarps” vs. the other four classes. Collapsing the classes and transforming the variables to decrease variability of Red and Green further improved the linear separability. All the models used performed with 99% accuracy with cross validation using the training data. All except the Lasso model used the following formula:
\[ ClassBinary = Blue+Green+Red+GBSqr+RBSqr\] where \(GBSqr=(Green + Blue)^2\) and \(RBSqr=(Red + Blue)^2\)
The lasso model selected the formula: \[ClassBinary = Red + Green + Blue + (Red + Green)^2\]
The functions with the additive sqare terms performed better than the simplest model (ClassBinary = Blue), and slightly better than the basic additive model (ClassBinary = Blue + Red + Green).
Even though a pixel with only a blue tarp in it is easily identifiable, pixels partially representing blue tarps require more predictors to perform better than guessing.
I look forward to discovering if the hold-out/validation data set also separates as well.
Distribution of Classes and How It Affects Model Selection
Only 3% of the observations in the training data are labeled “Blue Tarp”. This is a very unbalanced training set. This is not unexpected because it would be extremely surprising if a high percentage of land area was covered by blue tarps. In fact, if that were the case, then I would expect it would not be challenging for aid workers to find survivors because survivors would be covering a high percentage of Haiti, or Haiti would just have blue tarps everywhere and they would be of little predictive power for finding survivors.
99% accuracy, while impressive sounding, needs to be considered within context. Accuracy is 97% if the model predicted “No” for every pixel; however no blue tarps would be identified and survivors would remain undiscovered by aid workers. The “best guess” scenario, with high accuracy, provides no value for the humans in our toy problem. 99% accuracy is better than guessing, and the closer to 100% accuracy we can get, the better.
Before selecting any model, I recommend working with experienced aid workers and local Haitian residents with knowledge of the land to determine the choice between, and balance of these considerations. However, in this case, if we want the model that is least likely to send aid workers to non-blue tarps we will select the tuned Logistic Regression Model WIth Sqr Terms; additionally if we want the model that correctly identified the greatest percentage of blue tarps we will also select the tuned Logistic Regression Model with Sqr Terms.
The Logistic Regression Model with Sqr Terms, while lower in Precision, performed better on other measures which are more appropriate for this problem.
If time is of the essence, KNN Took a Long Time to Tune on my Laptop: Of all the models, tuning KNN for both the highest accuracy “number of neighbors” and highest accuracy probability threshold took the most time for my laptop to complete. In fact, on my laptop, I started the threshold tuning loop and went and did chores while periodically checking for completion. I am assuming that in a real-world scenario, each night the model would be validated and tuned with additional labeled data to improve performance. Adding to this concern, I only tested neighbors in the range 1 to 51, and with 60,000+ observations a larger k may have performed better, but the time needed to test additional k tuning parameters was too great.
If time is of the essence, which in this scenario it would be because getting predictions for locations of survivors is only the first step in making a plan for volunteers to reach the survivors, it may make sense to select a model that does not take as long to produce a result. The KNN model performed almost as well as the Logistic Regression with the complex model when considering Negative Predictive Value, and False Negative Rate; however the Logistic Regression Model performed better on these two metrics. However, the KNN model performed better at AUROC, and Precision than the Logistic Regression model, but with a significant increase in time to compute.
Very small penalty for additional parameters:
The penalty (\(\lambda\)) that performed the best using elasticnet penalized regression was very small. The conclusion I draw is that while lasso performed the best as accuracy when comparing ridge, elasticnet, and lasso, the penalties for the additional predictors was very small and with such a small ratio of number of parameters to number of observations, it wasn’t significantly useful, in this case, to shrink the model.
Request For Additional Information for Real-World Implementation
I am interested in how to use limited resources to reach the greatest number of vulnerable people. Without the GIS information for each pixel I am unable to calculate locations where the most blue tarps will be found to able to help the most people; however, there may not be a 1:1 relationship between number of tarps and number of people, so resource allocation becomes even trickier. My conclusion is that without the added GIS information for the observations I cannot provide information to help an efficient distribution of aid to those affected by the earthquake. And, demographic information regarding population density, and members-per-household would be of priceless value. Additionally, GIS information for pre-earthquake roads, and other geographic features would be useful.
The following code is not evaluated in the knitted document, but is included here for completeness.
# directory
holdout_dir = "HoldOut"
# unzip
unzip(file.path(holdout_dir,"Hold+Out+Data.zip"), exdir = holdout_dir)read_lines(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"), n_max=12)df_NotBlue_057 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_057)
dim(df_NotBlue_057)read_lines(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"), n_max=12)df_NotBlue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_067)
dim(df_NotBlue_067)read_lines(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"), n_max=12)df_NotBlue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_069)
dim(df_NotBlue_069)read_lines(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"), n_max=12)df_NotBlue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_078)
dim(df_NotBlue_078)read_lines(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)#> Error in file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"): object 'holdout_dir' not found
head(df_Blue_067)#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'df_Blue_067' not found
dim(df_Blue_067)#> Error in eval(expr, envir, enclos): object 'df_Blue_067' not found
File 2
read_lines(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_Blue_069)
dim(df_Blue_069)read_lines(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_Blue_078)
dim(df_Blue_078)# join the 7 dataframes of hold out data
df_HoldOut = dplyr::bind_rows(df_NotBlue_057, df_NotBlue_067, df_NotBlue_069, df_NotBlue_078, df_Blue_067, df_Blue_069, df_Blue_078)
dim(df_HoldOut)
head(df_HoldOut)library(readr)
write_csv(df_HoldOut, "holdout.csv")Confirm we have the entire training data set for training each model with the tuned parameters
haitiAllTraining = haitiBinaryRF %>%
mutate(ClassBinary = ifelse(ClassBinary == "No","0","1")) %>%
mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo))
head(haitiAllTraining)#> # A tibble: 6 x 13
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <chr> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 0 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 0 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 0 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 0 21904 40000 5472 3744 3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>
filename_ho = 'holdout.csv'
haiti_ho <- read_csv(filename_ho)#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> Lat = col_double(),
#> Lon = col_double(),
#> Red = col_double(),
#> Green = col_double(),
#> Blue = col_double(),
#> ClassBinary = col_double()
#> )
dim(haiti_ho)#> [1] 2004177 6
Next, I apply the transformations used on the testing data:
haiti_ho_full = haiti_ho %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = factor(ClassBinary)) %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue)) %>%
mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo)) %>%
mutate(ClassBinary = factor(ClassBinary))
head(haiti_ho_full) #> # A tibble: 6 x 14
#> Lat Lon Red Green Blue ClassBinary GBSqr RBSqr RGSqr RGBSqr RG RB
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <I<d> <I<d> <I<d> <I<db> <dbl> <dbl>
#> 1 18.5 -72.4 104 89 63 0 23.1 27.9 37249 65536 9256 6552
#> 2 18.5 -72.4 101 80 60 0 19.6 25.9 32761 58081 8080 6060
#> 3 18.5 -72.4 103 87 69 0 24.3 29.6 36100 67081 8961 7107
#> 4 18.5 -72.4 107 93 72 0 27.2 32.0 40000 73984 9951 7704
#> 5 18.5 -72.4 109 99 68 0 27.9 31.3 43264 76176 10791 7412
#> 6 18.5 -72.4 103 73 53 0 15.9 24.3 30976 52441 7519 5459
#> # ... with 2 more variables: GB <dbl>, ClassBinaryYesNo <fct>
haiti_ho_full %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 1989697 99
#> 2 1 14480 1
Our test data is only 1% Blue Tarps, which is a differente distribution from our training data.
The training-tuned threshold for the logistic regression model was 0.1. I will use this threshold when predicting the hold out data.
# train the selected logistic regression model with the complete training data set
glm.all.train = glm(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = haitiAllTraining)#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.preds = glm.all.train %>% augment(newdata=haiti_ho_full) %>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(glm.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1966055 89
#> 1 23642 14391
#>
#> Accuracy : 0.9882
#> 95% CI : (0.988, 0.9883)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5433
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9881
#> Specificity : 0.9939
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.3784
#> Prevalence : 0.9928
#> Detection Rate : 0.9810
#> Detection Prevalence : 0.9810
#> Balanced Accuracy : 0.9910
#>
#> 'Positive' Class : 0
#>
For Logistic Regression, the hold-out test results with the training-set-tuned threshold of 0.1 are:
function_ROC_AUC(glm.preds$.fitted, glm.preds$ClassBinary, "ROC Hold-out Logistic Regression")#> [1] 0.9997198
The AUC for hold-out the test data using a logistic regression model is 0.9997.
# non fits the model on the entire training set
trctrl.lda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
lda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "lda", trControl=trctrl.lda.selected, tuneLength = 10)lda.test = predict(lda.cv.model.selected, newdata = haiti_ho_full, type="prob")lda.preds = lda.test %>%
mutate(.prediction=ifelse(Yes < .1, "0", "1")) %>%
mutate(.prediction = factor(.prediction))caret::confusionMatrix(lda.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1953269 974
#> 1 36428 13506
#>
#> Accuracy : 0.9813
#> 95% CI : (0.9811, 0.9815)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4128
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9817
#> Specificity : 0.9327
#> Pos Pred Value : 0.9995
#> Neg Pred Value : 0.2705
#> Prevalence : 0.9928
#> Detection Rate : 0.9746
#> Detection Prevalence : 0.9751
#> Balanced Accuracy : 0.9572
#>
#> 'Positive' Class : 0
#>
For LDA, the hold-out test results with the tuned threshold of 0.1 are:
function_ROC_AUC(lda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out LDA")#> [1] 0.9966824
The AUC for hold-out the test data using an LDA is 0.997.
# fit the model on the entire training set
trctrl.qda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
qda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "qda", trControl=trctrl.qda.selected, tuneLength = 10)# test with holdout data
qda.test = predict(qda.cv.model.selected, newdata = haiti_ho_full, type="prob")
#head(qda.test)qda.preds = qda.test %>%
mutate(.prediction=ifelse(Yes < .1, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(qda.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1950515 8264
#> 1 39182 6216
#>
#> Accuracy : 0.9763
#> 95% CI : (0.9761, 0.9765)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.1988
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9803
#> Specificity : 0.4293
#> Pos Pred Value : 0.9958
#> Neg Pred Value : 0.1369
#> Prevalence : 0.9928
#> Detection Rate : 0.9732
#> Detection Prevalence : 0.9773
#> Balanced Accuracy : 0.7048
#>
#> 'Positive' Class : 0
#>
For QDA, the hold-out test results with the tuned threshold of 0.1 are:
function_ROC_AUC(qda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out QDA")#> [1] 0.6348543
The AUC for hold-out the test data using an QDA and a threshold of 0.1 is 0.635.
# fit the model on the entire training set
train_knn <- train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "knn", tuneGrid = data.frame(k=27))
# test with the hold out data
knn.test = predict(train_knn, newdata = haiti_ho_full, type="prob")knn.preds = knn.test %>%
mutate(.prediction=ifelse(Yes < .1, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(knn.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1956384 1380
#> 1 33313 13100
#>
#> Accuracy : 0.9827
#> 95% CI : (0.9825, 0.9829)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4239
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9833
#> Specificity : 0.9047
#> Pos Pred Value : 0.9993
#> Neg Pred Value : 0.2822
#> Prevalence : 0.9928
#> Detection Rate : 0.9762
#> Detection Prevalence : 0.9768
#> Balanced Accuracy : 0.9440
#>
#> 'Positive' Class : 0
#>
For KNN, the hold-out test results with the tuned threshold of 0.1 are:
function_ROC_AUC(knn.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out KNN")#> [1] 0.9541119
The AUC for hold-out the test data using an KNN | k = 27 is 0.954.
I already have the matrix for the training data to retrain the model with the complete data set. However, I do need to create the testing matrix.
I tried using the lambda from the tuning, and it did not produce convergence on the test data.
x.haitiHoldout.test = model.matrix(frmla, data = haiti_ho_full)[,-1] # removing the intercept term from the formula
# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiHoldout.test = (haiti_ho_full %>% mutate(ClassBinary = ifelse(ClassBinary == '0', FALSE, TRUE)))$ClassBinaryTraining on entire training data set, and predict with the hold out data. Evaluation of the testing data is performed with the tuned lambda and the best threshold based on accuracy.
#- fit the lasso with all the training data and teh best lambda
lasso_holdout_model = glmnet(x.haitiFull.train, y.haitiFull.train, alpha = 1,
family="binomial")
# make test predictions
lasso.test = predict(lasso_holdout_model, x.haitiHoldout.test, s = 5.630136e-05, type = "response")
head(lasso.test)#> 1
#> 1 6.321608e-06
#> 2 1.004848e-05
#> 3 6.478616e-05
#> 4 4.688065e-05
#> 5 4.757437e-06
#> 6 1.555468e-06
lasso_out_confusion = tibble()
#- evaluate fold k
y_true = y.haitiHoldout.test
# set the threshold to p
thres = 0.2
y_hat = lasso.test > thres
#- output
lasso_out_confusion = bind_rows(lasso_out_confusion,
tibble(truth = y.haitiHoldout.test,
fitted = y_hat))
lasso_out_confusion = lasso_out_confusion %>% mutate(truth = factor(truth), fitted = factor(fitted))
caret::confusionMatrix(lasso_out_confusion$fitted, lasso_out_confusion$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1974381 100
#> TRUE 15316 14380
#>
#> Accuracy : 0.9923
#> 95% CI : (0.9922, 0.9924)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.6476
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9923
#> Specificity : 0.9931
#> Pos Pred Value : 0.9999
#> Neg Pred Value : 0.4842
#> Prevalence : 0.9928
#> Detection Rate : 0.9851
#> Detection Prevalence : 0.9852
#> Balanced Accuracy : 0.9927
#>
#> 'Positive' Class : FALSE
#>
For Lasso, the hold-out test results with the tuned threshold of 0.2 are:
function_ROC_AUC(lasso.test, haiti_ho_full$ClassBinary, "ROC Hold-out Lasso")#> [1] 0.9997257
The AUC for hold-out the test data using an lasso with tuned lambda and threshold of 0.2 is 0.9997.
haitiAllTraining = haitiAllTraining %>% mutate(ClassBinary = factor(ClassBinary))
head(haitiAllTraining)#> # A tibble: 6 x 13
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 0 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 0 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 0 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 0 21904 40000 5472 3744 3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>
#- fit the random forest with all the training data use the best tuning
# training
rf.haiti_holdout = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = haitiAllTraining, mtry=4, ntree=1000)
# test
rf.preds_holdout = predict(rf.haiti_holdout, newdata = haiti_ho_full, type="prob")
y_true = haiti_ho_full %>% mutate(ClassBinary = (ClassBinary == "1"))
y_true_val = y_true$ClassBinary
head(y_true_val)#> [1] FALSE FALSE FALSE FALSE FALSE FALSE
# set the threshold to p
y_hat = rf.preds_holdout[,2] > 0.1
#head(y_hat)
out_yhat_ytrue = tibble()
#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat, X0 = rf.preds_holdout[,1], X1 = rf.preds_holdout[,2]))
out_yhat_ytrue = out_yhat_ytrue %>% mutate(true = factor(true), hat = factor(hat))caret::confusionMatrix(out_yhat_ytrue$hat, out_yhat_ytrue$true)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1957786 695
#> TRUE 31911 13785
#>
#> Accuracy : 0.9837
#> 95% CI : (0.9836, 0.9839)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4521
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9840
#> Specificity : 0.9520
#> Pos Pred Value : 0.9996
#> Neg Pred Value : 0.3017
#> Prevalence : 0.9928
#> Detection Rate : 0.9769
#> Detection Prevalence : 0.9772
#> Balanced Accuracy : 0.9680
#>
#> 'Positive' Class : FALSE
#>
For Random Forest using 1000 trees, and 4 predictors at each split, the hold-out test results with the tuned threshold of 0.1 are:
#head(out_yhat_ytrue)
function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$true, "ROC Hold-out Random Forest")#> [1] 0.976642
The AUC for the testing data on the tuned Random Forest model is 0.977.
svm_model_holdout = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=haitiAllTraining, kernel="linear", ranges=list(cost=c(1)), probability=TRUE)
svm.preds_holdout = predict(svm_model_holdout$best.model, newdata = haiti_ho_full, probability=TRUE)y_hat = attr(svm.preds_holdout, "probabilities")[,"1"] > 0.1
out_svm_holdout = tibble()
# use y_true_val from random forest
out_svm_holdout = bind_rows(out_svm_holdout,
tibble(pred_value = y_hat, truth_value = haiti_ho_full$ClassBinary, noProb = attr(svm.preds_holdout, "probabilities")[,"0"], yesProb = attr(svm.preds_holdout, "probabilities")[,"1"]))
out_svm_holdout = out_svm_holdout %>% mutate(truth_value = (truth_value == 1), pred_value = factor(pred_value), truth_value = factor(truth_value))
#head(out_svm_holdout)caret::confusionMatrix(out_svm_holdout$pred_value, out_svm_holdout$truth_value)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1968406 98
#> TRUE 21291 14382
#>
#> Accuracy : 0.9893
#> 95% CI : (0.9892, 0.9895)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5691
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9893
#> Specificity : 0.9932
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.4032
#> Prevalence : 0.9928
#> Detection Rate : 0.9822
#> Detection Prevalence : 0.9822
#> Balanced Accuracy : 0.9913
#>
#> 'Positive' Class : FALSE
#>
For SVM using a linear kernel, with a cost == 1, the hold-out test results with the tuned threshold of 0.1 are:
function_ROC_AUC(out_svm_holdout$yesProb, out_svm_holdout$truth_value, "ROC Hold-out SVM")#> [1] 0.999491
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | 0.9998 | 0.1 | 0.9882 | 0.994 | 0.012 | 0.38 |
| LDA | N/A | 0.997 | 0.1 | 0.9813 | 0.9327 | 0.018 | 0.2705 |
| QDA | N/A | 0.635 | 0.1 | 0.9763 | 0.43 | 0.019 | 0.1369 |
| KNN | k = 27 | 0.9861 | 0.1 | 0.9827 | 0.905 | 0.017 | 0.2822 |
| Penalized Log Reg | alpha = 1, lamda = 5.630136e-05 | 0.997 | 0.2 | 0.9923 | 0.993 | 0.008 | 0.4842 |
| Random Forest | ntree = 1000, mtry = 4 | 0.977 | 0.1 | 0.9838 | 0.952 | 0.015 | 0.3028 |
| SVM | linear kernel, cost == 1 | 0.9995 | 0.1 | 0.9816 | 0.994 | 0.018 | 0.2810 |
When using the training data, the KNN, Random Forest, and Logistic Regression models performed the best, and similarly with true positive rates of 98.66%, 98.5% and 98% respectively.
All three models were tuned for the best threshold using true positive rate, and all three had 0.1 as the best threshold.
KKN Model The KNN model was tuned for threshold and number of neighbors using true positive rate. The model used the additive model with the following features:The KNN model was tuned for number of neighbors using true positive rate to number of neighbors == 27.
Random Forest HERE IS WHERE I AM Features were added to the random forest because of a random forest ability to identify and used the following features:The random forest model was tuned using true positive rate to use 4 parameters at each split and 1000 trees.
The logistic regression model using an additive model of Blue, Green, Red, (Green + Blue)^2, and (Red + Blue)^2 as a threshold of 0.1 performed the best of the tested models. The logistic regression model performed best when comparing true positive rate, AUC, and accuracy.
Both the logistic regression and correctly classified 99.4% of the holdout data; however the logistic regression model classified fewer false positives (1.2% vs. 1.8%) which would result in fewer wasted resources.
TODO A discussion or analysis justifying why your findings above are compatible or reconcilable.
A recommendation and rationale regarding which algorithm to use for detection of blue tarps.
TODO A discussion of the relevance of the metrics calculated in the tables to this application context.
I would like to map all the locations that were correctly classified by the best performing model, i.e. the logistic regression model, and compare the true positives and false negatives. This visualization should allow us to determine if we should be concerned about missing blue tarps since we do not have a 100% true positive rate.
The logistic regression model performed the best based on True Positive Rate and False Positive Rate, so I will use the predictions of the hold out data from the logistic regression model using the threshold of 0.1.
We have 14,391 True Positive locations, and 89 False Negative locations.
#library(ggmap)
library(sf)#> Warning: package 'sf' was built under R version 4.0.5
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(mapview)#> Warning: package 'mapview' was built under R version 4.0.5
#> GDAL version >= 3.1.0 | setting mapviewOptions(fgb = TRUE)
library(leaflet)#> Warning: package 'leaflet' was built under R version 4.0.5
glm.preds = glm.preds %>% mutate(lat = Lat, lng = Lon)
#head(glm.preds)
true_pos_or_false_negs_preds = glm.preds %>%
filter((ClassBinary == 1 & .prediction == 1) | (ClassBinary == 1 & .prediction == 0))Below is a map of the True Positives and False Negatives. Based on inspecting this map, it appears the 14,391 true positive observations (shown in blue) correspond to approximately 28 locations of blue tarps in Dufour and Tesserot, and the false negative observations (shown in red, but you need to zoom in to see them) correspond to 3 areas on the edges of blue tarps that were correctly identified.
# Create a palette that maps factor levels to colors
pal <- colorFactor(c("red", "blue"), domain = c("0", "1"))
leaflet(true_pos_or_false_negs_preds) %>%
addTiles() %>%
addCircleMarkers(lng = ~lng, lat = ~lat, color = ~pal(.prediction))The leaflet library maps do not allow for zooming in at the individual pixel level of the dataset; however, the mapview library does. However, I was unable to get the mapview library to knit to html, so I include the code below and screen shots from example “blue tarps” in the dataframe containing labeled “True Positive” and “False Negative” results from the holdout data set. From this visualization you can see the that the false negatives are all from tarps where the majority of pixels were correctly predicted as true positives.
true_pos_or_false_negs_preds_sf = st_as_sf(true_pos_or_false_negs_preds, coords = c("lon", "lat"), crs = 4326)
mapview(true_pos_or_false_negs_preds_sf, zcol = ".prediction", burst = TRUE)All True Positive (yellow) and False Negative (purple) predictions
Example Tarp 1: True Positives (yellow) and False Negative (purple)
Example Tarp 2: True Positives (yellow) and False Negative (purple)
Example Tarp 3: True Positives (yellow) and False Negative (purple)
TODO A dicussion of processing speed.